Large data set negative information storage model

ABSTRACT

Systems and methods for storing large data sets, such as genetic sequence information. Within a “targeted subset” of positions with information, the system stores, both variant states and missing states at each position. Reference states are not stored, but are inferred within the targeted subset when neither a variant nor a missing state is stored at a given position. The absence of a variant state at a given position is assumed to be a reference state. The criteria for missing data are defined in pre-processing and are customizable based on the use case. For example, each data point may represent the genetic information of a sample at a position in the genome. The targeted subset may represent those positions that were included in a sequencing test.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.15/751,955, filed Feb. 12, 2018, entitled “Large Data Set NegativeInformation Storage Model,” which is a national stage application under35 U.S.C. § 371 of PCT/IB2016/054868, filed Aug. 12, 2016, which claimspriority to U.S. Provisional Patent Application No. 62/204,526, filedAug. 13, 2015. The disclosures of the above-referenced applications areincorporated herein by reference in their respective entireties.

BACKGROUND

Genomic information is an important molecular marker for human diseases,including acquired diseases such as cancer and both rare and commoninherited diseases such as Cystic Fibrosis and hypertension. Currentdata storage strategies associated with genomic information focus onstoring all known information. These strategies may result in a need tostore up to 3-6 billion pieces of information per individual. FIG. 1illustrates a small-scale example in which there are 77 elements stored,however, this would be much larger in a real-world example. Inparticular, for each position, either a reference value, a variantvalue, or a missing value is stored. As shown, no information is “notstored.” With regard to genomic information and other large data sets,the vastness of the storage need creates a burden for physical storageas well as computational resources to access the data.

A “reference value” in the above example is a nucleotide at a givenposition in the Human Genome Reference Sequence. As is understood,sequence data is often compared to a reference genome, e.g., like puzzlepieces to the picture on the box. Reference values can be stored in thesame database, in linking databases, or in another location. A uniquekey may be used to link the reference values to the experimentalpositions. For example, in genetic sequence data, a unique key can beconstructed from the chromosome, position, and reference version. Touniquely identify a variant, the model can use chromosome, positions,reference base, and variant base. A “variant” is any value that isdifferent from the reference at a given position.

An alternative to storing all known information is shown in FIG. 2,which illustrates a strategy to store only variants or differences froma known reference value. In FIG. 2, there are only 10 elements to store;however, as shown, the strategy fails to account for positions that aremissing. In particular, the “not stored” positions are wrongly assumedto be reference values. This results in a severe loss of precision,especially when different, non-overlapping genomic tests have beenperformed. This also results in increasing the risk of false negativeswhen different subsets are tested.

SUMMARY

The present disclosure describes systems and methods for storing largedata sets by reducing storage by storing what is NOT known. The methoddefines a “positive region” (a universe of possibilities where we cangenerally assume to have knowledge) for each sample. These regions maydiffer between samples to account for different tests. Then, theimportant genetic variants for each sample are stored using a uniquekey. Finally, areas of missing data (absent or low quality) are alsostored. To determine what the precise status of a sample is at a givendifference position, we first ask whether a position lies within apositive region. If so, the position is queried for either a differenceor for missing data. If neither are found, the status is “reference” orno change. If missing data is found, the sample is excluded from furthercalculations or reporting at that position. Assuming missing data ratesof 5-15%, this results in a savings of 85-95% per sample, allowingsmaller physical storage and computational resource requirements.Missing data may be stored as “regions” as opposed to individualpositions, further reducing storage requirements. Additionally, anegative storage model incentivizes the collection of high quality dataup front by upending the storage requirements at the end (more resourcesrequired for “less” data, as we store what's missing).

This summary is provided to introduce a selection of concepts in asimplified form that are further described below in the detaileddescription. This summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

Many aspects of the disclosure can be better understood with referenceto the following drawings. The components in the drawings are notnecessarily to scale, emphasis instead being placed upon clearlyillustrating the principles of the present disclosure. Moreover, in thedrawings, like reference numerals designate corresponding partsthroughout the several views:

FIG. 1 illustrates an exemplary storage model where all data is stored;

FIG. 2 illustrates an example storage model where only variants arestored;

FIG. 3 illustrates an example storage model where negative informationis stored;

FIGS. 4A and 4B illustrate example operational flows to encode orcompress information in the storage model of FIG. 3;

FIG. 5 illustrates querying the storage model of FIG. 3;

FIG. 6 illustrates an example operational flow to decode or decompressinformation in the storage model shown in FIG. 5; and

FIG. 7 shows an example computing environment data query.

DETAILED DESCRIPTION

The present disclosure describes system and methods for storing largedata sets, such as genetic sequence information. Within a “targetedsubset” of positions with information, the system of the presentdisclosure stores, both variant states and missing states at eachposition. Reference states are not stored, but are inferred within thetargeted subset when neither a variant nor a missing state is stored ata given position. The above is in contrast to conventional systems whereonly variants or differences are stored, and the absence of a variantstate at a given position is assumed to be a reference state.

In accordance with the present disclosure, criteria for missing data aredefined in pre-processing and are customizable based on the use case.For example, each data point may represent the genetic information of asample at a position in the genome. The targeted subset may representthose positions that were included in a sequencing test.

Queries may be run against the target subset as follows:

-   -   IF a queried position is NOT in subset, the system returns a        status of “unknown/missing”;    -   IF a queried position is in subset AND no data stored for        position, the system returns a status that is inferred to be        “reference/non-variant”;    -   IF a queried position is in subset AND variant stored for        position, the system returns a status of “variant”; and    -   IF a queried position in is subset AND “unknown/missing” is        stored for the queried position, the system returns a status of        “unknown/missing.”

With reference to FIG. 3, there is illustrated an implementation of astorage model in accordance with the present disclosure. As illustrated,subsets are identified. For example, three different subsets across 7individuals (rows) may be defined among the example 77 positions of FIG.3. Other numbers of subsets may be defined in accordance with thesampled data. Within each subset, for each position, data is stored ifthe value is a “variant” or data is “missing.” If data is not stored, itis assumed to be a predetermined reference value.

For positions outside of the identified subsets, no data is stored, asthese positions do not form a part of the sampled data (e.g., geneticinformation). Thus, the positions outside of the identified subsetsformed no part of the stored information. As a result, theimplementation of the storage model of the present disclosure stores 18elements among the 77 total possible positions while providing a higherlevel of precision than is possible using the storage model of FIG. 2and reducing the storage requirement from 77 to 18 elements with respectto the storage model of FIG. 1.

In a variation of the storage model of FIG. 3, the data may be stored asregions, whereby a start and end point is identified. For example,contiguous data of like kind may be identified as a region. Thus, thetwo “stored missing” values in the uppermost subset may be stored as asingle region. As such, this variation reduces the storage requirementfrom 18 elements to 17 elements. Although not shown, other contiguousdata of like kind was stored in the storage model of FIG. 3, such datamay also be represented by a region.

In addition or alternatively to storing elements as regions, bit arrays(integer representations of large sets of binary values) can be used tostore information, including but not limited to target subsetdefinitions (which are often applicable to many samples) and missingdata definitions (which can be both contiguous and sparse). Bit arraysenable more compact storage of positive data, and enable highlyefficient querying via binary math

Encoding or Compressing Data into the Negative Storage Model

FIG. 4A illustrates an operational flow 400 to encode/compress full data402 (as represented by FIG. 1) into the storage model of FIG. 3. Whilethe operational flow 400 describes a process with particular referenceto genomic information, it is contemplated by the present disclosurethat other types of data may be stored in the storage model of FIG. 3.Examples of such other data are discussed below. At 404, a targetedsubset of known information is defined. For example, the targeted subsetmay include genomic positions targeted by a clinical genetic test ortargeted sequencing experiment, items that are the subject of a survey,or others. The number of targeted subsets is usually smaller than thenumber of samples/subjects, as many subjects will have the sametest/survey. The targeted subsets limit the positions to those in the“universe of possibilities” and are stored in the database system 414for later use in decoding.

At 406 it is determined for every position or element in the full data402 whether the position is in the targeted subset 404. If no, positionis not stored, but inferred to be missing at 408. If yes, then at 410,positions are then examined for missing state at 412. The definition of“missing” may vary based on end-user needs and is further describedbelow with reference to FIG. 4B. Missing positions are then stored indatabase 414. Non-missing positions are tested at 416 to determine ifthe value is the same or different from a reference value. Variationsfrom the reference are stored in database 414. Additional informationrelated to the variant state (e.g., the actual value and other relatedvalues) are also stored in database 414, or may be linked from anotherdatabase system. At 418, positions that are the reference value are notstored (discarded), but will be inferred during decoding.

FIG. 4B illustrates an example operational flow to determine missingdata. Missing data can be defined in many ways, but is generally definedto be those positions at which there is no information. Missingpositions can include positions that were in a targeted subset, but didnot have any data (e.g., a failed position in a genetic test, unansweredquestion in a survey), positions where the data quality was insufficientfor distinguishing reference from variant values based on user criteria,or even positions that may have sufficient data, but for other reasonsare desired to be masked.

The operational flow of FIG. 4B may be performed on data 410 input tothe operation flow 400 that has passed the targeted subset decision 406of FIG. 4A (data 422 in FIG. 4B). A genetic sequence example 420 isshown, where a position is first assessed for any data at 424, and thenassessed for data of a specific quality at 426. Positions without data,or with data of insufficient quality are stored in database 414, whilepositions with data of sufficient quality are passed after 426 toprocess 416 for evaluation of variance. In a survey example 430, inputdata is different breakfast items asked about in a questionnaire. Foreach item, at 432, it is first determined whether an individual answeredthe question at 434. Positions without data, or with data ofinsufficient quality are stored in database 414. If an answer is given,it is determined at 436 whether the answer was allowable (e.g., “blue”is not an allowable answer for a yes/no question, resulting in a missingvalue to be stored). The positions with data of sufficient quality arepassed at 436 to process 416 for evaluation of variance.

Thus, from the examples 420 and 430, the definition of the “missing”state is flexible and can either be defined by a distributor, or madecustomizable by users. The flexibility enables the establishment ofdifferent quality tiers by storing a value with the ‘missing’ state. Forexample, genetic sequence data may have more strict requirements forclinical use compared to research use. Therefore, during the test formissing (412), two separate processes may be run, e.g., one very strictfor clinical use, one more lenient for research use. Positionsdetermined to be missing would include a value indicated the processthat determined the missing state (e.g., ‘clinical’, ‘research’, orboth). Any number of different missing tiers can be defined by valuesstored with, or linked to the missing state.

FIG. 5 illustrates querying the storage model of FIG. 3. As shown,queries may be run against positions (e.g., 501, 502 and 503) for eachsample. For a query of position 501, seven samples are part of the“known universe” (i.e., the defined subsets include all seven samples ofthe data set). Here it can be determined that six are not missing and 4out of 6 contain variant values. Comparing this result with the samequery run on the storage model of FIG. 2, the result would be differentas FIG. 2 would inaccurately return a result of 4 out of 7 havingvariant values. For a query of position 502, four samples are part ofthe “known universe” (i.e., the defined subsets include for samples ofthe data set). Here it can be determined that one is not missing and 0out of 1 have variant values. Comparing this result with the same queryrun on the storage model of FIG. 2, the incorrect result would be 0 outof 7 having variant values. For a query of position 503, four samplesare part of the “known universe” (i.e., the defined subsets include foursamples of the data set). Here it can be determined that four are notmissing and 3 out of 4 have variant values. Comparing this result withthe same query run on the storage model of FIG. 2, the incorrect resultwould be 3 out of 7 having variant values.

Thus, as would be understood, the storage model of the presentdisclosure provides for high accuracy while reducing storagerequirements for large data sets of information.

Decoding or Decompressing Data from the Negative Storage Model

FIG. 6 illustrates an example operational flow 600 to query anddecode/decompress information in the storage model shown in FIG. 3. At602, a position of interest to be queried from the storage model asstored in database 414 is identified. At 604, it is determined if theposition is in the subset(s) associated with entries in the storagemodel. If no, then at 606, the user is notified that there is no datafor the requested position of interest for a given sample or individual,or the position is ignored for further summarization or querying. Ifyes, then at 604, the position is in a subset(s) associated with thestorage model, and therefore is within the universe of possibleknowledge.

Next, at 608, it is determined if a variant is present. If so, then at610 the method returns an indication that the sample has a variant.Further information associated with the variant position can be returnedat 612. If not, then at 614, it is determined if the position is “storedmissing.” If not, then at 616, it is reported that the sample at theposition of interest has a reference value. If so at 614, then at 618,it is reported that there is missing data for the position of interest,and it is then ignored.

Thus, in accordance with the operational flow 600, the returned statesat each position (variant, reference, or missing) are used toreconstitute the original full data 402, thus preserving the originaldata with greatly reduced storage requirements.

While all the above examples of storing and querying data have beendemonstrated respect to genomic information, the storage models of thepresent disclosure may be used for other types of large data sets. Otheruses include, but are not limited to, healthcare where clinical pathways(describing a decision tree for “default” procedures to be performed ona patient). The target/realm of possibilities may be thoseactions/decisions appropriate for a disease. A non-standardoutcome/result of a given procedure may be stored as a “variant.”Pathway procedures that have not been performed may be “stored missing”and may be deleted as completed according to the pathway. Similarly, inhealthcare, the storage model may be used as in a patient encounterdatabase. The target/realm of possibilities may be procedures, forms,questions (i.e., everything that needs to happen with a patient). Theoutcome/result of a given procedure, or status of a form orquestionnaire may be stored as a “variant.” Procedures/forms that havenot been performed/returned may be stored as “missing data.”

In other possibilities, the storage model may be used for healthcare orother survey research as part of storing completion of differentsurveys. The target/realm of possibilities may include those surveys aperson is eligible to complete, which are stored as “variants.” Surveysnot completed for further follow-up may be stored as “missing data.”Upon completion, the “missing data” may be deleted (assume no data meanscompleted survey).

In yet other possibilities, the storage model be used in marketresearch, where opinions of people in relation to businesses are stored.The target/realm of possibilities may include a list of businesseswithin certain distance or that are “liked” in a social network. Theopinion on visited businesses may be stored as “variants.” Informationon businesses not visited or “like” links not clicked may be stored as“missing data.” Assume that other businesses were visited, links wereclicked, but no opinion was provided (this could also be inverted toassume businesses that are NOT visited, and store those that arevisited).

Data on polls/questionnaires can also be stored with this model. Thereference value for a given question can be the most popular response,which is then inferred. The targeted subset includes the questions agiven respondent was asked (based on the version of thepoll/questionnaire). Reference values (the popular answer) would not bestored, but inferred upon decoding. Unanswered questions, responses of“I don't know”, “Not applicable”, etc. would be stored as missing, whilethe less frequent responses would be stored as variant. Coding anddecoding would proceed as in FIGS. 4A, 4B and 6.

Preliminary Performance Results

The negative storage model of the present disclosure has beenimplemented using a publically available genetic sequence dataset (TheCancer Genome Atlas: TCGA). Somatic mutation data covering ^(˜)40million positions was encoded across 367 samples in the describedNegative Storage Model. The data was loaded to a full storage model (asin FIG. 1). Queries were submitted to both systems, and identicalresults were returned, demonstrating the ability of the Negative StorageModel to reconstitute the full data.

Compared to the full storage model, the number of rows stored in thedatabase for the Negative Storage model was 0.9% (i.e., a savingsof >99%). Loading the data to the Negative Storage model took between0.1%-1% of the length of time compared to the full storage model basedon the database system used. Querying time was generally faster for theNegative Storage model, with query times ranging from 100% (the same) to1% of that needed by the Negative Storage model. The actual disk spacetaken by the Negative Storage model ranged from 2.5%-1% of that taken bythe full storage model. Although these results are not meant to warrantor guarantee any specific level of performance, they demonstrate theNegative Storage Model is 1) effective, 2) able to be reduced topractice, and 3) a significant improvement over existing practice.

Cost Advantages

Common practice is to store what is known. Therefore, the more knowninformation, the higher the storage costs. The Negative Storage Modelincludes storing what is NOT known. In encoding, known information(reference) is not actually stored, but inferred during decoding.Therefore, this storage model may result in lower storage costs whenmore information is known: newly known reference values would beinferred, and not stored as “missing” values.

FIG. 7 shows an exemplary computing environment in which exampleimplementations and aspects may be implemented. The computing systemenvironment is only one example of a suitable computing environment andis not intended to suggest any limitation as to the scope of use orfunctionality.

Numerous other general purpose or special purpose computing systemenvironments or configurations may be used. Examples of well-knowncomputing systems, environments, and/or configurations that may besuitable for use include, but are not limited to, personal computers(PCs), server computers, handheld or laptop devices, multiprocessorsystems, microprocessor-based systems, network PCs, minicomputers,mainframe computers, embedded systems, distributed computingenvironments that include any of the above systems or devices, and thelike.

Computer-executable instructions, such as program modules, beingexecuted by a computer may be used. Generally, program modules includeroutines, programs, objects, components, data structures, etc. thatperforms particular tasks or implement particular abstract data types.Distributed computing environments may be used where tasks are performedby remote processing devices that are linked through a communicationsnetwork or other data transmission medium. In a distributed computingenvironment, program modules and other data may be located in both localand remote computer storage media including memory storage devices.

An exemplary system for implementing aspects described herein includes acomputing device, such as computing device 700. In its most basicconfiguration, computing device 700 typically includes at least oneprocessing unit 702 and memory 704. Depending on the exact configurationand type of computing device, memory 704 may be volatile (such as randomaccess memory (RAM)), non-volatile (such as read-only memory (ROM),flash memory, etc.), or some combination of the two. This most basicconfiguration is illustrated in FIG. 3 by dashed line 706.

Computing device 700 may have additional features/functionality. Forexample, computing device 700 may include additional storage (removableand/or non-removable) including, but not limited to, magnetic or opticaldisks or tape. Such additional storage is illustrated in FIG. 3 byremovable storage 708 and non-removable storage 710.

Computing device 700 typically includes a variety of computer readablemedia. Computer readable media can be any available media that can beaccessed by device 700 and include both volatile and non-volatile media,and removable and non-removable media.

Computer storage media include volatile and non-volatile, and removableand non-removable media implemented in any method or technology forstorage of information such as computer readable instructions, datastructures, program modules or other data. Memory 704, removable storage708, and non-removable storage 710 are all examples of computer storagemedia. Although not shown, the computer storage may include networkattached storage where another computer system acts as a storage devicefor the computer system 700. Computer storage media include, but are notlimited to, RAM, ROM, electrically erasable program read-only memory(EEPROM), flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by computing device 700. Any such computerstorage media may be part of computing device 700.

Computing device 700 may contain communications connection(s) 712 thatallow the device to communicate with other devices. Computing device 700may also have input device(s) 714 such as a keyboard, mouse, pen, voiceinput device, touch input device, etc. Output device(s) 716 such as adisplay, speakers, printer, etc. may also be included. All these devicesare well known in the art and need not be discussed at length here.

It should be understood that the various techniques described herein maybe implemented in connection with hardware or software or, whereappropriate, with a combination of both. Thus, the processes andapparatus of the presently disclosed subject matter, or certain aspectsor portions thereof, may take the form of program code (i.e.,instructions) embodied in tangible media, such as floppy diskettes,CD-ROMs, hard drives, or any other machine-readable storage mediumwhere, when the program code is loaded into and executed by a machine,such as a computer, the machine becomes an apparatus for practicing thepresently disclosed subject matter.

Although exemplary implementations may refer to utilizing aspects of thepresently disclosed subject matter in the context of one or morestand-alone computer systems, the subject matter is not so limited, butrather may be implemented in connection with any computing environment,such as a network or distributed computing environment. Still further,aspects of the presently disclosed subject matter may be implemented inor across a plurality of processing chips or devices, and storage maysimilarly be affected across a plurality of devices. Such devices mightinclude PCs, network servers, and handheld devices, for example.

Although the subject matter has been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the subject matter defined in the appended claims is notnecessarily limited to the specific features or acts described above.Rather, the specific features and acts described above are disclosed asexample forms of implementing the claims.

1. (canceled)
 2. A method of querying and decompressing a negativestorage model stored in a database, comprising: receiving a queryassociated with a position stored in the database in the negativestorage model; determining if the position is in a targeted subset of afull dataset, and if so; determining if the position is a variant andreturning the variant to the full dataset; determining if the positionis missing, and if so, returning a value of missing to the full dataset;determining if the position is neither variant nor missing, and if so,inferring that the position is a reference value and returning thereference value to the full dataset; and determining if the position isnot in the targeted subset of the full dataset, and if so, inferringthat the position is the reference value and returning the referencevalue to the full dataset.
 3. The method of claim 2, wherein the fulldataset is reconstituted from the subset, variant, and missing statesonly.
 4. The method of claim 2, wherein the negative storage modelstored in the database has a size that is smaller than a size of thefull dataset.
 5. The method of claim 2, further comprising retrievingadditional information from the database related to the variation in thedatabase.
 6. The method of claim 5, wherein the additional informationcomprises the actual value of the variation.
 7. The method of claim 2,wherein a missing value is user defined for the targeted subset.
 8. Themethod of claim 7, further comprising determining plural missing values.9. A database storage apparatus for querying a compressed negativestorage model, comprising: a processor; a memory that contains computerexecutable instructions that when executed by the processor causes thedatabase storage apparatus to: receive a query associated with aposition stored in the database in the negative storage model; determineif the position is in a targeted subset of a full dataset, and if so;determine if the position is a variant and returning the variant to thefull dataset; determine if the position is missing, and if so, returninga value of missing to the full dataset; determine if the position isneither variant nor missing, and if so, inferring that the position is areference value and returning the reference value to the full dataset;and determine if the position is not in the targeted subset of the fulldataset, and if so, inferring that the position is the reference valueand returning the reference value to the full dataset.
 10. The databasestorage apparatus of claim 2, wherein the full dataset is reconstitutedfrom the subset, variant, and missing states only.
 11. The databasestorage apparatus of claim 2, wherein the negative storage model storedin the database has a size that is smaller than a size of the fulldataset.
 12. The database storage apparatus of claim 2, whereinadditional information is retrievable from the database that is relatedto the variation in the database.
 13. The database storage apparatus ofclaim 5, wherein the additional information comprises the actual valueof the variation.
 14. The database storage apparatus of claim 2, whereina missing value is user defined for the targeted subset.
 15. Thedatabase storage apparatus of claim 7, wherein plural missing values areretrievable from the database.